Introduction

This report is the first one to document and study the feasability of the automatic quality evaluation of experimental literature investigating bio–nano interactions. The first step of this automatic evaluation is to isolate the section Materials and Methods. The goal is to use later this section only to assess if the characterisation of the nano-materials is done and ebaluate the quality of the articles.

This report contain preliminary analyses and exploration of the data contained in the corpus of text. The first goal of this analyses is to gain some understanding of the structure of the texts inside the corpus of articles and the relations of the lemmas “material(s)” and “method(s)” to this corpus.

The second goal is to investigate how to discriminate the beginning of the section “Materials and methods”. The main problem to identify entry of the section Materials and Methods is that some of this two words can be present in the text of the article (typically “cf” material and methods").

The corpus of text has been created from the 751 articles from the folder “Full Text dev set”, which contain 751 articles converted into txt file format. The others articles are kept unseen to test the efficacy of any other tools developped later in “real life condition”.

Few definitions to frame the problem :

A quick exploratory data analysis on the article Abrams, MT et al, 2010, led to think that the the “materials” token from the section material and method has a specific property : is head_token_id is equal to zero, i.e. the “head” of this word is itself (cf example under). This led to think that sections titles of aritcles may have this property. This hypothesis will be test in the first part of this report, and in a later section, for the lemma “materials” and “material” (Co-occurences for materials and material when their head_token_id = 0)

In the later section, we will try differents criteria to isolate some lemmas “materials”, “material”, “methods” and “method”. We will use a technic, co-occurences, to explore the surronding of the differents lemmas in the text and evaluate if this criteria allow to discriminate the beginning of the section materials and methods from the remaining of the article.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It is a good way to create informal reports describing data analysis projects as a web page, and a good way to mix code and description in a readable maner. There is even books in this format, ranging from Data Analysis for the Life Sciences to Text Mining with R, A Tidy Approach, so anybody can understand and retake this work. This report is also code, it can be recompiled with new data (including an other model for the annotation of the corpus).

Import and datastructure

library(udpipe)
library(lattice)
library(wordcloud)
library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)

The following lines load the corpus of text, already annotated and tokenized :

x <- readRDS(file = "annotation.rds")
x <- as.data.frame(x)
length(unique(x$doc_id))
## [1] 751

Here an example of a token “materials” with a head_token_id = 0 :

x[7467,]
##      doc_id paragraph_id sentence_id                       sentence
## 7467   doc1          599         830 Materials and Methods Animals.
##      token_id     token     lemma upos xpos       feats head_token_id
## 7467        1 Materials materials NOUN  NNS Number=Plur             0
##      dep_rel deps misc
## 7467    root <NA> <NA>

Words with head_token_id == 0

Considering the observation that, in “Materials and Methods” the head_token_ID was 0 for the token “Materials”, one idea was to explore what are, in the corpus of texts, the most common lemma with a head_token_ID equal to zero.

The expected outcome of this analysis could be to retrieve the usual sections title of scientific articles inside the most common words, like Abstract or Results. The goal is to assess if it is a consistent property of the titles of section inside the articles and uncover potential synonyms to “materials and methods” like “experimental section”.

stats <- subset(x, head_token_id == 0) #https://bnosac.github.io/udpipe/docs/doc7.html
stats <- txt_freq(x = stats$lemma)

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0", xlab = "Freq")

Nonetheless, it seems that this assumption was quite naive, as lot of token have this property. Let’s filter for specific lemmas that correspond to usual title of section, like abstract of results :

stats<-stats %>% filter(key %in% c("material", "materials", "result", "results", "abstract", "introduction" , "method", "methods", "discussion", "references"))

stats$key <- factor(stats$key, levels = rev(stats$key))
barchart(key ~ freq, data = head(stats, 30), col = "cadetblue", main = "Count of lemma for usual sections name with Head_token_id = 0", xlab = "Freq")

stats
##             key freq     freq_pct
## 1        result 1829 0.3668933422
## 2        method  534 0.1071192153
## 3     materials  449 0.0900684038
## 4    discussion  376 0.0754247658
## 5  introduction  268 0.0537602054
## 6      material  132 0.0264789071
## 7       methods   82 0.0164490181
## 8      abstract   19 0.0038113578
## 9       results    8 0.0016047823
## 10   references    2 0.0004011956

Some section titles seems to have the afored mentionned property. Nonetheless, the number does not match the total number of articles in this corpus (751). To take the example of the token discussion, or some articles does not have a section dicussion, or, more probably, the token discussion does not have the property mentionned earlier. We can answer this question :

occurrences<-which(x$lemma=="discussion")
length(occurrences)
## [1] 891
length(unique(x[occurrences,]$doc_id))
## [1] 703

There is 891 occurrences of the word discussion in all the corpus, and 703 article with this word. It seems really likely that discriminating tokens that are section titles just with a head token ID of zero is not sufficient.

Visualize the most recurent head_token_id of the lemma material, materials, method and methods

To explore the relationships of the lemmas “material(s)” and “method(s)” with the rest of the corpus, we can analyse what are the most recurents head tokens for the lemmas “material” and “materials”. The goals of the analysis are :

  • to observe if the lemma “material(s)” is often associated as head with the lemma “material(s)” and with which frequency
  • to observe what are the other lemma that are commonly the head of the lemma material(s)
  • same question(s) for the lemma “method” and “methods”

Lemma material

grep_lemma_head_token_id <- function(index){
  #catch the lemma corresponding to the head_token_id of the token at the entry "index" of x
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  head_token_id<-occurrence$head_token_id
  head_token_id<-as.numeric(head_token_id)
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the lemma of the head_token_id based on the previous parameters
  lemma_head_token_id<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[head_token_id],]$lemma
  if (head_token_id==0) {lemma_head_token_id=occurrence$lemma}
  return(lemma_head_token_id)
}

material_occurrences<-which(x$lemma=="material")
head_token_lemmas<-sapply(material_occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring lemma corresponding to the head_token_id \n for lemma material", xlab = "Freq")

Lemma materials

occurrences<-which(x$lemma=="materials") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 

stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma materialS with an s", xlab = "Freq")

Lemma method

occurrences<-which(x$lemma=="method") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

Lemma methods

occurrences<-which(x$lemma=="methods") 

head_token_lemmas<-sapply(occurrences, grep_lemma_head_token_id)
mytable<-table(head_token_lemmas)
stats<-as.data.frame(mytable) 


stats<-stats[order(stats$Freq, decreasing = TRUE),]
stats$key <- factor(stats$head_token_lemmas, levels = rev(stats$head_token_lemmas))
barchart(key ~ Freq, data = head(stats, 30), col = "cadetblue", main = "Most occurring words with Head_token_id = 0 \n for lemma methods with a S", xlab = "Freq")

head(stats, 10)
##     head_token_lemmas Freq       key
## 88          materials  108 materials
## 92            methods   83   methods
## 44           describe   39  describe
## 79                  j   29         j
## 95                Mol   20       Mol
## 61            Enzymol   14   Enzymol
## 87           material   11  material
## 126           section   11   section
## 139         synthesis   10 synthesis
## 96               Mol.    9      Mol.

Co-occurences

Co-occurences for material(s) and method(s)

In the next sessions we test differents criteria to discriminate the lemmas “materials” and “material” inside the articles. The idea is to find a criteria that allow to identify the beginning of the section “materials and methods”.

Co-occurrence is an analysis that allow to see how words are used either in the same sentence or next to each other. We will use this approach to have a sense of what is the neighbourhood of the lemmas we isolated based on each criteria.

There is several type of cooccurrences analysis : * Looking at which words are located in the same document/sentence/paragraph. * Looking at which words are followed by another word. * Looking at which words are in the neighbourhood of the word as in follows the word within skipgram number of words.

Cf doc of the package Updipe for the three possible use. We will use the second approach, as it is the most relevant to our goal and as it is the most simple to interpret. Differents skipgram can be used to got an idea of the distance or more proximal neighbourhood.

The two function above are meant to gain some place in the document. The first one plot the word network, a common technique to visualise word cooccurrences, after the filtration of the cooccurrences that concerns only the lemma of interrest.

plot_cooccurrence <- function(stats, lemma, title){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  wordnetwork <- head(stats, 30)
  wordnetwork <- graph_from_data_frame(wordnetwork)
  ggraph(wordnetwork, layout = "fr") +
    geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
    geom_node_text(aes(label = name), col = "blue", size = 5) +
    theme_graph(base_family = "Helvetica") +
    theme(legend.position = "none") +
    labs(title = title)
}
head_cooc <- function(stats, lemma){
  #function to gain place and make this Rmarkdown document more clear
  stats <- stats %>% filter(term1 %in% c(lemma) | term2 %in% c(lemma))
  head(stats, 30)
}
stats <- cooccurrence(x = x$lemma, skipgram = 0)

Bigger skipgram were not really relevant. Here we can simply count the elements of the dataframe stats to see how many times each word follow each other.

plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials")

head_cooc(stats, lemma="materials")
##            term1     term2 cooc
## 1      materials       and  591
## 2              . materials  513
## 3      materials         ,   91
## 4             of materials   80
## 5      materials         .   78
## 6      materials        be   71
## 7      materials         &   65
## 8      materials  research   59
## 9        applied materials   58
## 10         these materials   46
## 11     materials   Science   43
## 12    Biomedical materials   37
## 13       methods materials   35
## 14           the materials   32
## 15    BIOMEDICAL materials   30
## 16     Amorphous materials   28
## 17           and materials   23
## 18     materials      have   22
## 19        method materials   22
## 20     materials   science   19
## 21     materials Chemistry   19
## 22           see materials   19
## 23     materials        in   18
## 24   Proteineous materials   17
## 25            in materials   16
## 26     materials       for   15
## 27 Supplementary materials   15
## 28      advanced materials   15
## 29     materials         (   14
## 30     Hazardous materials   13
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material")

head_cooc(stats, lemma="material")
##            term1    term2 cooc
## 1       material        .  376
## 2       material        ,  305
## 3       material       be  278
## 4       material      and  216
## 5            the material  214
## 6             of material  209
## 7           test material  195
## 8       material        (  141
## 9       material       in  137
## 10      material       at  130
## 11 supplementary material  106
## 12 Supplementary material   94
## 13         these material   89
## 14          this material   74
## 15          bulk material   73
## 16      material      for   67
## 17      material     that   60
## 18           and material   57
## 19      nanotube material   56
## 20       foreign material   51
## 21     reference material   48
## 22      material     with   47
## 23      material     have   44
## 24      material        :   40
## 25      material       on   39
## 26          size material   38
## 27             . material   33
## 28      material        [   32
## 29      material      the   32
## 30      material       to   32
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods")

head_cooc(stats, lemma="methods")
##          term1         term2 cooc
## 1          and       methods  174
## 2            .       methods  138
## 3      methods             .   71
## 4           in       methods   42
## 5      methods           for   39
## 6      methods     materials   35
## 7      methods           Mol   28
## 8      Immunol       methods   28
## 9      methods       Enzymol   28
## 10        Mech       methods   21
## 11     methods             )   20
## 12     methods             ,   18
## 13     methods          Mol.   17
## 14           ,       methods   17
## 15     methods             (   16
## 16     methods           and   13
## 17     methods     synthesis   11
## 18 alternative       methods   11
## 19     methods       Animals   10
## 20     methods       section   10
## 21     methods     chemicals    9
## 22  analytical       methods    9
## 23     methods            in    7
## 24         see       methods    7
## 25      revise       methods    6
## 26     methods      Enzymol.    6
## 27         use       methods    6
## 28     methods Nanoparticles    5
## 29     methods            to    5
## 30        Test       methods    5
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method")

head_cooc(stats, lemma="method")
##         term1    term2 cooc
## 1         and   method  506
## 2      method        .  491
## 3         the   method  449
## 4      method      for  448
## 5      method       of  315
## 6           .   method  293
## 7      method       be  279
## 8      method        ,  269
## 9      method       to  225
## 10     method        (  203
## 11     method      2.1  139
## 12     method      and  130
## 13     method      use  128
## 14     method describe  126
## 15       this   method  119
## 16     method        :  117
## 17          )   method   99
## 18          a   method   94
## 19       test   method   81
## 20     method        [   80
## 21     method       in   77
## 22     method     have   60
## 23     method       as   52
## 24     method        )   46
## 25  sensitive   method   46
## 26     method     with   44
## 27      vitro   method   39
## 28 analytical   method   37
## 29     method     that   35
## 30          :   method   35

Co-occurences, visualization of all the lemma of interest

plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##        term1     term2 cooc
## 1  materials       and  591
## 2          . materials  513
## 3        and    method  506
## 4     method         .  491
## 5        the    method  449
## 6     method       for  448
## 7   material         .  376
## 8     method        of  315
## 9   material         ,  305
## 10         .    method  293
## 11    method        be  279
## 12  material        be  278
## 13    method         ,  269
## 14    method        to  225
## 15  material       and  216
## 16       the  material  214
## 17        of  material  209
## 18    method         (  203
## 19      test  material  195
## 20       and   methods  174
## 21  material         (  141
## 22    method       2.1  139
## 23         .   methods  138
## 24  material        in  137
## 25    method       and  130
## 26  material        at  130
## 27    method       use  128
## 28    method  describe  126
## 29      this    method  119
## 30    method         :  117

Co-occurences for materials and material when their head_token_id = 0

Similar to the previous approach, we want to explore the relationships of the differents lemma with their neighbourhood in the corpus of text, but we restrict the analysis for sentences for which the lemma material or materials is the head token of itself.

Even if not all the “Materials and Methods” section titles has a “materials” lemma with a head_token_id equal to zero, the opposite could be true.

Here, by restricting to the lemmas “materials” and “material” which have a head_token_id = 0, we can visualize their statistical association with other words and understand if this subsets of token is really delimiting the beginning of section “material and methods”.

The first function allow to filter for sentences where the lemma material or materials is the head. The following lines calculate the co-occurrences and draw the plot as previously.

create_subset_corpus<- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when their head_token_id = 0
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following lines collect the head_token_id and test if is equal to zero
  #if so, its output the tokens of the sentences
  head_token_id<-occurrence$head_token_id
  if (head_token_id==0) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}

strip_corpus <- function(doc_id, sentence_id){
  #this function returns all the lemma of a sentence, in the appropriate format
  #the purpose of doing so is to allow for calculation of cooccurence of words inside this sentences
  #for this we need all the elements of the sentence
  sentence_id<-as.numeric(sentence_id)
  subset_article<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id),]
  return(subset_article)
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="materials")
##            term1     term2 cooc
## 1      materials       and  344
## 2              . materials  260
## 3         method materials   78
## 4      materials         .   28
## 5      materials materials   19
## 6        methods materials   15
## 7      materials         &    9
## 8      materials         2    8
## 9      materials       for    7
## 10    Mesoporous materials    4
## 11       applied materials    4
## 12     materials         5    4
## 13     materials        of    4
## 14   particulate materials    3
## 15             2 materials    3
## 16     materials         ,    3
## 17          test materials    3
## 18      advanced materials    3
## 19     materials         6    2
## 20     copolymer materials    2
## 21     materials    within    2
## 22 Supplementary materials    2
## 23    Biomedical materials    2
## 24             : materials    2
## 25  nanoparticle materials    2
## 26            of materials    2
## 27        nature materials    2
## 28 Antibacterial materials    2
## 29     materials         (    2
## 30     Amorphous materials    2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas :  materials, material, method, method, \n when head_token_id of lemma materials is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##          term1       term2 cooc
## 1    materials         and  344
## 2          and      method  280
## 3            .   materials  260
## 4       method         2.1  114
## 5       method   materials   78
## 6          and     methods   49
## 7       method           2   36
## 8    materials           .   28
## 9    materials   materials   19
## 10     methods   materials   15
## 11      method      Animal   11
## 12   materials           &    9
## 13   materials           2    8
## 14   materials         for    7
## 15           &      method    6
## 16      method           :    6
## 17     methods     Animals    5
## 18      method preparation    4
## 19     methods   chemicals    4
## 20  Mesoporous   materials    4
## 21     methods         2.1    4
## 22     applied   materials    4
## 23   materials           5    4
## 24   materials          of    4
## 25 particulate   materials    3
## 26           2   materials    3
## 27     methods           2    3
## 28   materials           ,    3
## 29        test   materials    3
## 30    advanced   materials    3
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when its head_token_id is equal to 0\n when its head_token_id is equal to 0")

head_cooc(stats, lemma="material")
##            term1         term2 cooc
## 1       material           and   20
## 2       material             .   19
## 3  supplementary      material   14
## 4  Supplementary      material   14
## 5              .      material   13
## 6       material     available   12
## 7           test      material   10
## 8       material             ,    8
## 9        section      material    7
## 10      material          with    6
## 11    Copyrighte      material    6
## 12      material Supplementary    4
## 13        method      material    4
## 14      material      material    4
## 15      material          that    4
## 16      material            in    4
## 17     reference      material    4
## 18    mesoporous      material    4
## 19      material          from    3
## 20      material           for    3
## 21      material             2    3
## 22     composite      material    3
## 23      material            as    3
## 24     important      material    3
## 25      material             (    2
## 26      material            on    2
## 27      material    Electronic    2
## 28      material  experimental    2
## 29         oxide      material    2
## 30        result      material    2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma material is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1         term2 cooc
## 1       material           and   20
## 2       material             .   19
## 3  supplementary      material   14
## 4  Supplementary      material   14
## 5            and        method   14
## 6              .      material   13
## 7       material     available   12
## 8           test      material   10
## 9       material             ,    8
## 10       section      material    7
## 11        method           2.1    6
## 12      material          with    6
## 13    Copyrighte      material    6
## 14           and       methods    5
## 15      material Supplementary    4
## 16     materials           and    4
## 17        method      material    4
## 18      material      material    4
## 19      material          that    4
## 20      material            in    4
## 21     reference      material    4
## 22    mesoporous      material    4
## 23      material          from    3
## 24      material           for    3
## 25      material             2    3
## 26     composite      material    3
## 27      material            as    3
## 28     important      material    3
## 29      material             (    2
## 30             .     materials    2

Co-occurences for methods and method when their head_token_id = 0

occurrences<-which(x$lemma=="methods")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="methods", title="Co-occurences for the lemma methods \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="methods")
##        term1      term2 cooc
## 1          .    methods   32
## 2    methods    Enzymol   14
## 3    methods        for   11
## 4    methods          .    8
## 5       Mech    methods    8
## 6    Immunol    methods    7
## 7    methods        Mol    6
## 8     revise    methods    5
## 9          [    methods    5
## 10   methods          ]    5
## 11    assess    methods    3
## 12   methods       Phys    3
## 13   methods   Enzymol.    3
## 14   methods        and    3
## 15   methods       Mol.    2
## 16   methods         in    2
## 17   methods Production    1
## 18   methods          (    1
## 19   methods     Enzym.    1
## 20    Enzym.    methods    1
## 21   methods     109:55    1
## 22  Enzymol.    methods    1
## 23 materials    methods    1
## 24       Nat    methods    1
## 25   methods 2008;5:763    1
## 26  Standard    methods    1
## 27      USA.    methods    1
## 28   methods          9    1
## 29   methods   2010;20(    1
## 30   methods    General    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma methods is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##        term1      term2 cooc
## 1          .    methods   32
## 2    methods    Enzymol   14
## 3    methods        for   11
## 4    methods          .    8
## 5       Mech    methods    8
## 6    Immunol    methods    7
## 7    methods        Mol    6
## 8     revise    methods    5
## 9          [    methods    5
## 10   methods          ]    5
## 11    assess    methods    3
## 12   methods       Phys    3
## 13   methods   Enzymol.    3
## 14   methods        and    3
## 15   methods       Mol.    2
## 16   methods         in    2
## 17   methods Production    1
## 18   methods          (    1
## 19   methods     Enzym.    1
## 20    Enzym.    methods    1
## 21   methods     109:55    1
## 22  Enzymol.    methods    1
## 23       and  materials    1
## 24 materials    methods    1
## 25       Nat    methods    1
## 26   methods 2008;5:763    1
## 27  Standard    methods    1
## 28      USA.    methods    1
## 29   methods          9    1
## 30   methods   2010;20(    1
occurrences<-which(x$lemma=="method")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="method", title="Co-occurences for the lemma method \n when its head_token_id is equal to 0")

head_cooc(stats, lemma="method")
##          term1  term2 cooc
## 1            . method  168
## 2       method    for  108
## 3       method      :   79
## 4            : method   45
## 5       method      .   44
## 6       method     to   42
## 7       method method   29
## 8       method     of   26
## 9       method     in   16
## 10      method      ,   14
## 11      method    use   13
## 12           ) method   13
## 13         the method   13
## 14      method    and   13
## 15        easy method   10
## 16   sensitive method   10
## 17      method    2.1    9
## 18        test method    8
## 19           a method    8
## 20      method      (    8
## 21      simple method    7
## 22    standard method    6
## 23      method    the    6
## 24 Statistical method    6
## 25       vitro method    6
## 26      method   that    5
## 27    reliable method    5
## 28      method      a    4
## 29   efficient method    4
## 30         new method    4
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when head_token_id of lemma method is equal to 0")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##          term1  term2 cooc
## 1            . method  168
## 2       method    for  108
## 3       method      :   79
## 4            : method   45
## 5       method      .   44
## 6       method     to   42
## 7       method method   29
## 8       method     of   26
## 9       method     in   16
## 10      method      ,   14
## 11      method    use   13
## 12           ) method   13
## 13         the method   13
## 14      method    and   13
## 15        easy method   10
## 16   sensitive method   10
## 17      method    2.1    9
## 18        test method    8
## 19           a method    8
## 20      method      (    8
## 21      simple method    7
## 22    standard method    6
## 23      method    the    6
## 24 Statistical method    6
## 25       vitro method    6
## 26      method   that    5
## 27    reliable method    5
## 28      method      a    4
## 29   efficient method    4
## 30         new method    4

Co-occurences for materials and material when it is the last lemma of the document

We could assume that the last occurrence in an article of the lemma “materials” correspond to the section title “material and methods”. As before, we will use co-occurrences see how words are connected to the last occurrence of “materials” in each documents, and see how often it correspond to a “materials and methods” section.

The first two functions select the last occurrence of a word in a document, and got the id of their sentences. A graph showing the connection of words for this subset of sentences is then plot.

create_subset_corpus_last_lemmas <- function(index){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for materials and material when it is the last lemma of the document
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  lemma<-occurrence$lemma
  occurrences_in_doc=which(x$doc_id==doc_id & x$lemma==lemma)
  last_occurrence=occurrences_in_doc[length(occurrences_in_doc)]
  if (last_occurrence==index){return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for the lemma materials \n when it is the last lemma of the document")

head_cooc(stats, lemma="materials")
##            term1     term2 cooc
## 1      materials       and  306
## 2              . materials  242
## 3      materials         .   48
## 4         method materials   37
## 5             of materials   24
## 6      materials         ,   23
## 7      materials        be   22
## 8        methods materials   20
## 9      materials materials   14
## 10     materials         &   13
## 11     materials  research   12
## 12           and materials   12
## 13     materials       for   11
## 14     Amorphous materials   11
## 15     materials        in   10
## 16     materials   Science   10
## 17     materials   science    9
## 18           the materials    9
## 19 Supplementary materials    9
## 20         these materials    8
## 21     nanosized materials    8
## 22       applied materials    7
## 23    BIOMEDICAL materials    7
## 24     materials      Inc.    7
## 25         other materials    6
## 26     nanoscale materials    6
## 27     materials        to    6
## 28    Biomedical materials    5
## 29           all materials    5
## 30           see materials    5
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when materials is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1            term2 cooc
## 1      materials              and  306
## 2              .        materials  242
## 3            and           method  210
## 4            and          methods   75
## 5         method              2.1   56
## 6      materials                .   48
## 7         method        materials   37
## 8             of        materials   24
## 9      materials                ,   23
## 10     materials               be   22
## 11       methods        materials   20
## 12     materials        materials   14
## 13     materials                &   13
## 14        method           Animal   12
## 15     materials         research   12
## 16           and        materials   12
## 17     materials              for   11
## 18        method                2   11
## 19     Amorphous        materials   11
## 20     materials               in   10
## 21     materials          Science   10
## 22        method        chemicals   10
## 23     materials          science    9
## 24           the        materials    9
## 25 Supplementary        materials    9
## 26         these        materials    8
## 27     nanosized        materials    8
## 28       methods                .    8
## 29        method Characterization    8
## 30       methods          Animals    7
occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index) create_subset_corpus_last_lemmas(index))
subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for the lemma material \n when it is the last lemma of the document")

head_cooc(stats, lemma="material")
##            term1            term2 cooc
## 1             of         material   99
## 2       material                .   92
## 3       material               at   86
## 4       material               be   53
## 5       material                ,   53
## 6       material              and   48
## 7  Supplementary         material   43
## 8            the         material   39
## 9           this         material   29
## 10      nanotube         material   27
## 11      material               in   23
## 12      material                :   23
## 13      material                (   22
## 14      material        available   22
## 15          test         material   19
## 16 supplementary         material   19
## 17      material              for   17
## 18           and         material   17
## 19     reference         material   10
## 20             /         material   10
## 21      material             from    9
## 22            in         material    9
## 23      material                /    9
## 24       genetic         material    9
## 25         these         material    9
## 26      material characterization    8
## 27             a         material    8
## 28      material            refer    8
## 29      adequate         material    8
## 30      material               as    7
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when material is the last lemma of the document")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##            term1            term2 cooc
## 1             of         material   99
## 2       material                .   92
## 3       material               at   86
## 4       material               be   53
## 5       material                ,   53
## 6       material              and   48
## 7  Supplementary         material   43
## 8            the         material   39
## 9           this         material   29
## 10           and           method   28
## 11      nanotube         material   27
## 12      material               in   23
## 13      material                :   23
## 14      material                (   22
## 15      material        available   22
## 16          test         material   19
## 17 supplementary         material   19
## 18      material              for   17
## 19           and         material   17
## 20        method                ,   12
## 21             :           method   11
## 22     reference         material   10
## 23             /         material   10
## 24      material             from    9
## 25            in         material    9
## 26      material                /    9
## 27       genetic         material    9
## 28         these         material    9
## 29     materials              and    8
## 30      material characterization    8

Co-occurences for lemma materials and material when they are the first lemma of a sentence

Materials

create_subset_corpus <- function(index, target){
  #this function is aimed to help construct a subset of x for the part of the analysis :
  #Co-occurences for lemma materials and material when they are the first lemma of a sentence
  #x[index,] return a token and all the associated data : lemma, but also sentence and doc_id
  occurrence<-x[index,] #x[index,], where x is the dataframe of annotation generated by udpipe
  sentence_id<-occurrence$sentence_id
  doc_id<-occurrence$doc_id
  #the following line query the first lemma of the sentence in the good document
  first_lemma<-x[which(x$sentence_id==sentence_id & x$doc_id==doc_id)[1],]$lemma
  if (first_lemma==target) {return(strip_corpus(doc_id, sentence_id))} 
  return()
}
occurrences<-which(x$lemma=="materials")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="materials")

subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="materials", title="Co-occurences for lemma materials when it is the first lemma of a sentence")

head_cooc(stats, lemma="materials")
##               term1     term2 cooc
## 1         materials       and  393
## 2                 . materials  264
## 3            method materials  124
## 4           methods materials   50
## 5         materials materials   30
## 6         materials         .   23
## 7         materials   Science   10
## 8                 : materials   10
## 9         materials         &    9
## 10        materials        be    7
## 11        Amorphous materials    6
## 12        materials   science    5
## 13         material materials    5
## 14        materials  Chitosan    5
## 15          Animals materials    5
## 16     nanoparticle materials    5
## 17        materials         5    5
## 18        materials         ,    5
## 19        materials       for    4
## 20        chemicals materials    4
## 21        materials       PTX    4
## 22        materials  Pristine    4
## 23              663 materials    4
## 24    Nanoparticles materials    4
## 25               of materials    4
## 26 Characterization materials    3
## 27                , materials    3
## 28                ) materials    3
## 29        materials         )    3
## 30           animal materials    2
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##           term1            term2 cooc
## 1     materials              and  393
## 2             .        materials  264
## 3           and           method  261
## 4        method        materials  124
## 5           and          methods  115
## 6       methods        materials   50
## 7     materials        materials   30
## 8     materials                .   23
## 9        method           Animal   21
## 10       method              2.1   13
## 11       method      preparation   11
## 12       method        chemicals   11
## 13       method         material   10
## 14    materials          Science   10
## 15            :        materials   10
## 16    materials                &    9
## 17       method Characterization    9
## 18      methods          Animals    8
## 19       method                :    8
## 20    materials               be    7
## 21       method        synthesis    6
## 22            &           method    6
## 23    Amorphous        materials    6
## 24    materials          science    5
## 25     material        materials    5
## 26      methods        chemicals    5
## 27    materials         Chitosan    5
## 28      Animals        materials    5
## 29 nanoparticle        materials    5
## 30       method          Reagent    5

Material

occurrences<-which(x$lemma=="material")
subset_corpus<-sapply(occurrences, function(index, target) create_subset_corpus(index, target),
                    target="material")

subset_corpus<-do.call(rbind, subset_corpus)

stats <- cooccurrence(x = subset_corpus$lemma, skipgram = 0)
plot_cooccurrence(stats, lemma="material", title="Co-occurences for lemma material when it is the first lemma of a sentence")

head_cooc(stats, lemma="material")
##               term1            term2 cooc
## 1          material              and   20
## 2                 .         material   17
## 3           methods         material    5
## 4          material               on    4
## 5            method         material    3
## 6              test         material    2
## 7          material             once    2
## 8          material        treatment    2
## 9          material         -induced    2
## 10         material                &    2
## 11             Test         material    2
## 12 Characterization         material    2
## 13        Organisms         material    1
## 14         material           supply    1
## 15 characterization         material    1
## 16         material characterization    1
## 17       validation         material    1
## 18         material      engineering    1
## 19          Reagent         material    1
## 20            study         material    1
## 21         material      composition    1
## 22          altered         material    1
## 23         material    investigation    1
## 24               be         material    1
## 25        component         material    1
## 26         material         material    1
## 27         material       Implanting    1
## 28       Implanting         material    1
## 29         material       deposition    1
## 30         material                ,    1
plot_cooccurrence(stats, lemma=c("materials", "material", "methods", "method"), title="Co-occurences for several lemmas : materials, material, method, method, \n when lemma material is the first lemma of a sentence")

head_cooc(stats, lemma=c("materials", "material", "methods", "method"))
##               term1            term2 cooc
## 1          material              and   20
## 2                 .         material   17
## 3               and          methods   11
## 4               and           method    6
## 5           methods         material    5
## 6          material               on    4
## 7            method         material    3
## 8              test         material    2
## 9          material             once    2
## 10           method           Animal    2
## 11         material        treatment    2
## 12         material         -induced    2
## 13         material                &    2
## 14                &           method    2
## 15          methods             Test    2
## 16             Test         material    2
## 17 Characterization         material    2
## 18          methods                .    1
## 19          methods           Silica    1
## 20        Organisms         material    1
## 21         material           supply    1
## 22 characterization         material    1
## 23         material characterization    1
## 24       validation         material    1
## 25         material      engineering    1
## 26          methods     Cytotoxicity    1
## 27          Reagent         material    1
## 28            study         material    1
## 29          methods      preparation    1
## 30                a           method    1

Conclusion